Abusing the internet with popular search engine technologies
                                    by c0ntex | c0ntexb[at]gmail.com
                ------------------------------------------------------------------------------------------

When you think of a search engine, you probably conjure up an image of the handy, html based site that can
help you find those RFC's, wiring diagrams, soft-warez, serial keys or questionable images, depending on your
taste. You've probably used generic search engines like Google, Yahoo, MSN, Lycos & Hotbot a trillion times
without considering even the possibility that it could be used to support an attack attempt.

Due to the sheer volume of HTTP based attacks, one has to consider than anything utilising HTTP is a serious
theoretical attack vector. Too many people feel happy in the knowledge that only HTTP holes are punched in
the firewall, if the server daemon is patched and user input locations are verified as safe then there is no
need to worry. This is a serious misconception that one should avoid.

HTTP fingerprinting, SQL Injection, malformed header injections, XXS, cookie fun, every conceivable form of
web based application or service attack can be managed by your security infrastructure. Yet what is the use
in having any of this preventative security to protect your data from prying eyes, when you have allowed
search bots to come in happily and crawl all web servers.


Taken from the Berkely teaching guide, stating that Spiders are:
"Computer robot programs, referred to sometimes as "crawlers" or"knowledge-bots" or "knowbots" that are used
by search engines to roam the World Wide Web via the Internet, visit sites and databases, and keep the search
engine database of web pages up to date. They obtain new pages, update known pages, and delete obsolete ones.
Their findings are then integrated into the "home" database."


When any powerful technology is utilised by a curious mind, possessing advanced knowledge of the internal
workings, it becomes easy to transform the technology into a potentially malicious weapon. In this case, the
engine soon becomes an advanced data-gathering tool for attackers. Everyone knows about RFP's web mining tool
called Whisker, yet not everyone is aware that a search engine can do the same...and more.

Google is the most powerful search engine online and it probably has been for the last four years or more,
crawling literally millions of websites and public archives daily. The amount of data that it contains would
crush any reasonable human mind due to the sheer scale. When data is being stored on such a vast scale and
the manner in which it's gathered by engines, it is of no wonder that sometimes-sensitive information will be
included.

The spiders that Google use to harvest all these sites have no mind, they are merely an axon connection to a
central soma, acting like a neural network. The data is filtered down the axon if it passes a simple logic
test, the soma then sends it into the central brain. The only energy required to sustain crawling is a diet
similar to that of the pecking pigeons which perform the ranking inside the database clusters.

http://www.google.com/technology/pigeonrank.html

When testing a company site or domain, it is always useful to use Google as it's crawling software is superb
and will find some pretty interesting stuff you might otherwise overlook. It also has a very nice cache
function that will allow you to see "old" pages that were once available from that site. Attackers can use
this feature to compare pages, documents, structures and the likes.


Some fictional public BBS sits in your DMZ, using flat text and CGI. The BBS contains a plain text or md5 encrypted
password file that sits in the web directory and Mr Security forgets about it. One week later, a search engine
spider creeps along, crawls your site and just happens to find the password text. 1 month later we come along and
search your domain for password.txt and guess what pops up.

Interestingly, it seems a user from the internal network has subscribed to the BBS, using the same password as
his domain password. Now things become more interesting, especially when you find out after a very authentic
sounding phone call to his office receptionist, that the user is a member of IT staff.
                                                                                                                                     
        IT: domains, routers, switches, servers, databases.
        Human: laziness, imperfection, same password everywhere, domain admin account?.


It's possible to block spiders by using the robots file, providing the spider can read. An example robots.txt
file looks like:


	User-agent: *   # This shoo's all spiders off with a big
	Disallow: /     # stick, no flies around this server:


	User-agent: googlebot # Fend off googlebot:
	Disallow: /cgi-bin/
	Disallow: /images/


	# Tease vulnerable spiders:
	User-agent: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA x 350
	Disallow: /GoogleBot PAYLOAD :-)/


If you really suffer from arachnophobia, you could block the crawler bot parasite at your router / firewall.

NOTE: This text is not trying to encourage the use of Google as an attack tool, acting on information found could be
illegal. Yet it is worth considering the ramifications of what data you allow your users to place in their private web
directories.

Example usage of the Google advanced search features:


	allinurl:						# Find all strings in the url
	allintitle:						# Find all strings in the title
	filetype:						# Find only specific filetype
	intitle:						# Find any strings in the title
	inurl:							# Find any strings in the url
	link:							# Find strings in link
	site:							# Find strings on this site

Specific:
	allinurl:session_login.cgi			# Nagios/NetSaint/Nagmin
	allintitle:/cgi-bin/status.cgi?			# Network monioring
	allintitle:/cgi-bin/extinfo.cgi			# Nagios/NetSaint

	inurl:cpqlogin.htm				# Compaq Insight Agent
	allinurl:"Index of /WEBAGENT/"		# Compaq Insight Agent
	allinurl:/proxy/ssllogin				# Compaq Insight Agent
	allintitle:WBEM Login				# Compaq Insight Agent
	WBEM site:domain.com		 	# Compaq Insight Agent at domain.com

	inurl:vslogin OR vsloginpage			# Visitor System or Vitalnet

	allintitle:/exchange/root.asp			# Exchange Webmail
	"Index of /exchange/"				# Exchange Webmail Directory

	netopia intitle:192.168				# Netopia Router Config


General:
	site:blah.com filename:cgi			# Check cgi source code

	allinurl:.gov passwd				# A blackhat favorite
	allinurl:.mil passwd				# Another blackhat favorite

	"Index of /admin/"
	"Index of /cgi-bin/"
	"Index of /mail/"
	"Index of /passwd/" 
	"Index of /private/"
	"Index of /proxy/" 
	"Index of /pls/"
	"Index of /scripts/"
	"Index of /tsc/"
	"Index of /www/"

	"Index of" config.php OR config.cgi

	intitle:"index.of /ftp/etc" passwd OR pass -freebsd -netbsd -openbsd
	intitle:"index.of" passwd -freebsd -netbsd -openbsd

	inurl:"Index of /backup"
	inurl:"Index of /tmp"
	inurl:auth filetype:mdb                         # MDB files
	inurl:users filetype:mdb
	inurl:config filetype:mdb

	inurl:clients filetype:xls			# General spreadsheets

	inurl:network filetype:vsd			# Network diagrams


EOF

# milw0rm.com [2006-04-08]